Skip to content

Conversation

@sahilkumarsingh
Copy link
Contributor

What changes were proposed in this pull request?

This PR will address the issue SPARK-54634.

With this, I am adding a user-friendly error message when users write SQL queries with an empty IN clause, like: SELECT * FROM table WHERE col IN ()

Why are the changes needed?

When users write SQL with an empty IN clause, Spark currently produces a syntax error of subclass [PARSE_SYNTAX_ERROR], which leads the user to believe that their syntax is incorrect, whereas the actual issue is due to the absence of values for the IN clause. Hence, the current error message does not communicate the right intention to the user.

This change provides a clear, actionable error message that explains the actual problem
and suggests alternatives.

Example - Before:

org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'IN'. SQLSTATE: 42601 (line 1, pos 33)

Example - After:

org.apache.spark.sql.catalyst.parser.ParseException:
[INVALID_SQL_SYNTAX.EMPTY_IN_PREDICATE] Invalid SQL syntax: IN predicate requires at least one value. Empty IN clauses like 'IN ()' are not allowed. Consider using 'WHERE FALSE' if you need an always-false condition, or provide at least one value in the IN list. SQLSTATE: 42000

Does this PR introduce any user-facing change?

Yes, users will now see a better error message.

Code executed: spark.sql("SELECT * FROM range(10) WHERE id IN ()").show()

Before output:
image

After output:
image

How was this patch tested?

  • I have added unit tests in QueryParsingErrorsSuite.scala and SQL golden tests added in predicate-functions.sql
  • I have also tested the build locally by running the query in spark-shell

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic) - used for code assistance, test generation, and documentation.

@github-actions github-actions bot added the SQL label Dec 8, 2025
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the error message better!

exception = parseException(sql2),
condition = "PARSE_SYNTAX_ERROR",
parameters = Map("error" -> "'IN'", "hint" -> ""))
parameters = Map("error" -> "'INTO'", "hint" -> ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error message before and after this change for this test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Allison,

This is the before and after this change for this test case:

Before:

[scala> spark.sql("SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))").show()
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'IN'. SQLSTATE: 42601 (line 1, pos 25)

== SQL ==
SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))
-------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:285)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
  ... 42 elided

After:

[scala> spark.sql("SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))").show()
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'INTO'. SQLSTATE: 42601 (line 1, pos 36)

== SQL ==
SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))
------------------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:267)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:78)
  at org.apache.spark.sql.execution.SparkSqlParser.super$parse(SparkSqlParser.scala:163)
  at org.apache.spark.sql.execution.SparkSqlParser.$anonfun$parseInternal$1(SparkSqlParser.scala:163)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:107)
  at org.apache.spark.sql.execution.SparkSqlParser.parseInternal(SparkSqlParser.scala:163)
  at org.apache.spark.sql.execution.SparkSqlParser.parseWithParameters(SparkSqlParser.scala:70)
  at org.apache.spark.sql.execution.SparkSqlParser.parsePlanWithParameters(SparkSqlParser.scala:84)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$6(SparkSession.scala:573)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:572)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:563)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:591)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:682)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:92)
  ... 42 elided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @allisonwang-db , could you check this output and let me know, thanks!

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Much better error message.

@sahilkumarsingh
Copy link
Contributor Author

Thanks for approving the changes, @allisonwang-db. Do you happen to know when this PR might be merged?

@allisonwang-db
Copy link
Contributor

cc @cloud-fan

errorClass = "INVALID_SQL_SYNTAX.EMPTY_IN_PREDICATE",
messageParameters = Map(
"alternative" -> ("Consider using 'WHERE FALSE' if you need an always-false condition, " +
"or provide at least one value in the IN list.")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass the alternative as an error parameter, instead of just put it in the error message template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back, it's quite possible to directly include this alternative in the error message template. Shall I make this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please check.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9c67509 Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants